10 research outputs found
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
Operationalizing Machine Learning: An Interview Study
Organizations rely on machine learning engineers (MLEs) to operationalize ML,
i.e., deploy and maintain ML pipelines in production. The process of
operationalizing ML, or MLOps, consists of a continual loop of (i) data
collection and labeling, (ii) experimentation to improve ML performance, (iii)
evaluation throughout a multi-staged deployment process, and (iv) monitoring of
performance drops in production. When considered together, these
responsibilities seem staggering -- how does anyone do MLOps, what are the
unaddressed challenges, and what are the implications for tool builders?
We conducted semi-structured ethnographic interviews with 18 MLEs working
across many applications, including chatbots, autonomous vehicles, and finance.
Our interviews expose three variables that govern success for a production ML
deployment: Velocity, Validation, and Versioning. We summarize common practices
for successful ML experimentation, deployment, and sustaining production
performance. Finally, we discuss interviewees' pain points and anti-patterns,
with implications for tool design.Comment: 20 pages, 4 figure
Revisiting Prompt Engineering via Declarative Crowdsourcing
Large language models (LLMs) are incredibly powerful at comprehending and
generating data in the form of text, but are brittle and error-prone. There has
been an advent of toolkits and recipes centered around so-called prompt
engineering-the process of asking an LLM to do something via a series of
prompts. However, for LLM-powered data processing workflows, in particular,
optimizing for quality, while keeping cost bounded, is a tedious, manual
process. We put forth a vision for declarative prompt engineering. We view LLMs
like crowd workers and leverage ideas from the declarative crowdsourcing
literature-including leveraging multiple prompting strategies, ensuring
internal consistency, and exploring hybrid-LLM-non-LLM approaches-to make
prompt engineering a more principled process. Preliminary case studies on
sorting, entity resolution, and imputation demonstrate the promise of our
approac
Waltzing binaries: Probing line-of-sight acceleration of merging compact objects with gravitational waves
Line-of-sight acceleration of a compact binary coalescence (CBC) event would
modulate the shape of the gravitational waves (GWs) it produces with respect to
the corresponding non-accelerated CBC. Such modulations could be indicative of
its astrophysical environment. We investigate the prospects of detecting this
acceleration in future observing runs of the LIGO-Virgo-KAGRA network, as well
as in next-generation (XG) detectors and the proposed DECIGO. We place the
first observational constraints on this acceleration, for putative binary
neutron star mergers GW170817 and GW190425. We find no evidence of
line-of-sight acceleration in these events at confidence. Prospective
constraints for the fifth observing run of the LIGO at A+ sensitivity suggest
that accelerations for typical BNSs could be constrained with a precision of
, assuming a signal-to-noise ratio of .
These improve to in XG detectors, and in DECIGO. We also interpret these constraints
in the context of mergers around supermassive black holes.Comment: Accepted to Ap
Decibel: the relational dataset branching system
As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat
Influence of plasma modification on mechanical and thermal properties of Polypropylene/ Nano-Calcium Silicate Composites
The aim of the research is to study the influence of plasma modification on nano calcium silicate/polypropylene composites. Polypropylene (PP) is considered for this study as it possesses high impact strength, toughness and availability. Calcium silicate is considered as reinforcement because of its high temperature resistance, high flexural strength and high strength to mass ratio. Fourier transform infrared spectroscopy (FTIR) results show that there is a change in the functional group on the surface of calcium silicate after modification. Thermo-Gravimetric Analysis (TGA), Differential Scanning Calorimetry (DSC) results show that the decomposition temperature increased with increasing amount of filler particles. It is also observed that the modification has produced a marginal increase in the decomposition and glass transition temperature. Tensile test results also show a gradual increase in the tensile properties of composites when high ratio is reinforcement. Tensile test results also show that there is a marginal increase in the tensile strength when reinforced with modified calcium silicate when compared to non-modified. Scanning Electron Microscopy (SEM) reveals that there is a enhanced dispersion of nano particles on modification. Based on the findings it can be concluded that plasma modification enhances the thermal and mechanical property marginally
Influence of plasma modification on mechanical and thermal properties of Polypropylene/ Nano-Calcium Silicate Composites
The aim of the research is to study the influence of plasma modification on nano calcium silicate/polypropylene composites. Polypropylene (PP) is considered for this study as it possesses high impact strength, toughness and availability. Calcium silicate is considered as reinforcement because of its high temperature resistance, high flexural strength and high strength to mass ratio. Fourier transform infrared spectroscopy (FTIR) results show that there is a change in the functional group on the surface of calcium silicate after modification. Thermo-Gravimetric Analysis (TGA), Differential Scanning Calorimetry (DSC) results show that the decomposition temperature increased with increasing amount of filler particles. It is also observed that the modification has produced a marginal increase in the decomposition and glass transition temperature. Tensile test results also show a gradual increase in the tensile properties of composites when high ratio is reinforcement. Tensile test results also show that there is a marginal increase in the tensile strength when reinforced with modified calcium silicate when compared to non-modified. Scanning Electron Microscopy (SEM) reveals that there is a enhanced dispersion of nano particles on modification. Based on the findings it can be concluded that plasma modification enhances the thermal and mechanical property marginally
Collaborative data analytics with DataHub
While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native versioning capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various data-processing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook---an IPython-based notebook for analyzing data and storing the results of data analysis